1 Abstract

2 Introduction

3 Background

4 Descriptive analysis

4.1 Overview

In response to the goal of this analysis, there are various ways metrics that one can use to understand the severity of alcohol-related vehicle fatalities throughout the US. Some of which include the number of alcohol-related fatalities per state, or the rate of alcohol-related fatalities per 10k people, or even the ratio between alcohol-related fatalities with overall fatalities. Each metric will produce different results and different interpretation.

To get a good sense on how the type of metric can affect the interpretation of the outcome, we observe the map below, which plots each state in the USA with different shades of purple to signify the different mean proportion of alcohol-related fatalities by state. As we hover over the following map, we can see that California (CA) has one of the largest number of alcohol-related fatalities in the country (averaging more than 5000 deaths per year), yet its rate of alcohol-related fatalities per 10k people of 1.9 is almost half of that of Texas (TX) (3.6 alcohol-related fatalities per 10k people). On the other hand, if we were to compare the ratio of alcohol-related fatalities to overall fatalities, both California (0.26) and Texas’ (0.41) statistics were far below that of Mississippi’s (0.52).

These differences show the importance of choosing the suitable metric for the purpose of this data analysis. The choice of metric will be discussed in Section 4.2.1.2.

4.2 Exploratory Data Analysis

As the focus of this data analysis is to find out whether laws that were implemented to tackle drunk driving related fatalities, only a subset of the variables from the Fatalities dataset were used. In particular, response variables that were alcohol-related such as the total number of fatalities and alcohol fatalies were examined, while predictor variables that are closely related to alcohol-consumption-driven laws were also analyzed.

4.2.1 Univariate Analysis

We start off the exploratory data analysis procedure by individually examining the predictor and response variables. The goal here is to understand how the data is distributed, which helps set an expectation on how the variables correlate with each other, or whether model assumptions will be met.

4.2.1.1 Predictor Variables

As we are looking into how alcohol-consumption-driven laws impact the rate of alcohol-related fatalities, some variables of interest include spirit consumptions, beer tax, proportion of the population living in dry counties, minimum drinking age, and the mandatory punishments implemented by each state throughout the 7 years.

The plot below shows the top 5 states in terms of average spirits consumptions, average beer tax, and average proportion of population living in dry counties between 1982 and 1988. Other than North Carolina (NC) being in the top 5 states for beer tax and containing large proportion of dry residents, it can be seen that there is no other “standout” state below, ie. there’s no state present in more than one of the top 5 categories.

On the other hand, it can be seen that there has been an increasing implementation/tightening of laws throughout the 7 years. The most obvious changes here is the number of states that increased the minimum drinking age. In 1982, almost half the country had set their minimum drinking age to be less than 21, and yet most of the states have opted for 21 to be the minimum drinkage 7 years later.

Additionally, there seem to be a slight increasing trend in the number of states that implement testings (breath test) and punishments (mandatory jail sentence and mandatory community services) between 1982 and 1988. We need to note, however, that the number of states implementing mandatory jail sentences decreased very slightly from 1986 to 1988. This raises the question of whether a mandatory jail sentence is effective in combating the issue of drunk driving. Such questions will be addressed after fitting a suitable model.

4.2.1.2 Response Variables

After observing the trend of the implemented laws, the focus is now switched to analyzing the distribution of fatalities and alcohol fatalities across the country. A quick look at the top two histograms below might suggest that a large portion of states have less than 1000 fatalities per year, and less than 500 alcohol related fatalities per year. However, each state’s population need to be taken into account in this case due to the significant variation in population sizes across the country. Our new histograms (bottom two) tells us that the distributions of the data can be approximated as normal.

Since the goal of this analysis is to discuss the effects of alcohal-related law implementations, the alcohol-related fatalities becomes our main topic of interest. There are a number of approaches in determining the best metric for observing such specific fatalities, among which is the number of alcohol related fatalities per 10k people. However, the issue with such a metric is that it does not tell a good story on whether the implemented traffic policies had success in reducing the number of alcohol-related fatalities. Other factors could have come into play which resulted in a lower overall fatality rate in general, which in turn affects the alcohol-related fatalities rate.

In this analysis, the metric used for analyzing the effect of traffic laws on alcohol-related fatalities is the proportion of alcohol-related fatalities among the overall fatalities by state and year, ie. \[ p = \frac{\text{Number of alcohol-related fatalities}}{\text{Number of overall fatalities}} \] An advantage of this metric for the purpose of this data analysis is its robustness to the changes in overall fatalities. In other words, \(p\) is still able to give us useful and unbiased information on alcohol-related fatalities in response to the changes in overall fatalities in certain states or years.

The plot below shows the proportion of alcohol-related fatalities of each state throughout the years. It can be seen that the proportion of alcohol-related fatalities have been either constant or decreasing in those 7 years. This is more prevalent in states such as Kansas (KS), North Dakota (ND), and Arkansas(AR). However, there is one exception to this trend. In the line plot below, we observe that Mississipi had a significant increase in the proportion of alcohol fatalities from 1983 to 1988.

4.2.2 Multivariate Analysis

In this section, we will observe the pairwise interaction between the predictor and response variables. The first thing that was done was to examine the correlation between each pair of continuous predictor variables. In the scatterplot matrix below, it is obvious that there were no distinct patterns between the predictor variables throughout all years, which suggest low correlation between all pairs of continuous predictors.

Another thing we want to ensure prior to fitting any models in this analysis is the non-presence of the variance inflation factor (VIF). The VIF of the \(k\)th predictor, denoted as \(VIF_k\), is defined as \[ VIF_k = \frac{1}{1-R_k^2} \] where \(R_k^2\) is the coefficient of multiple determination when the predictor variable \(X_k\) is regressed onto the rest of the \(X\) variables. Intuitively, a large \(VIF_k\) value means that the predictor \(X_k\) can be well explained by the other \(X\) variables, which would ultimately lead to the multicollinearity phenomenon. Notice that \(R_k^2 \geq 0\), and therefore \(VIF_k \geq 1\). This means that we want to obtain \(VIF\) values that are as small as possible and as close to 1 as possible to prevent multicollinearity.

In the table below, we see that all the \(VIF_k\) values are close to 1, signifying that multicollinearity is not an issue in this data set.

With the predictor variables analyzed, we now proceed to the pairwise interactions between those predictor variables and the proportion of alcohol-related fatalities. In the series of boxplots below, we gain some insights that one would generally expect:

  • The proportion of alcohol-related fatalities show a decreasing trend troughout the years.
  • The proportion of alcohol-related fatalities is slightly lower when a mandatory community service is being implemented.
  • The proportion of alcohol-related fatalities is slightly lower when a preliminary breath test is being implemented.
  • The proportion of alcohol-related fatalities decreases as the minimum drinking age increases. However, note that the boxplot of alcohol-related fatalities when the minimum drinking age is 21 has a significantly larger variance.

On the other hand, an interesting finding from these boxplots that mandatory jail sentences somehow correlate with a larger proportion of alcohol-related fatalities. This unexpected finding could also be the reason that the number of states implementing such policy decreased between 1986 and 1988, as shown in the previous section.

After observing the change in proportion of alcohol-related fatalities in response to categorical variables and time, we then aim to do the same with the continuous predictor variables. In the interactive scatter plots below, each data point is colored by its implementation of the madatory jail sentence due to the unexpected findings from previous sections. While no concrete conclusions can be made in that regard, it can be seen that the data points in all four plots converge to the bottom left corner as the years go by. This tells us that:

  • The proportion of alcohol-related fatalities have decreased over the years, except for Mississippi. After the year 1985, the proportion of alcohol-related fatalities kept on increasing, while the proportion for other states have continuously decreased. By 1987 and 1988, Mississippi was the outlier in this data.
  • With the exception of the proportion of dry residents, the beer tax, unemployment rate, and spirits consumption generally observed a decreasing trend for all states throughout the 7 years.

Based on what we have seen so far, it does seem as if these policies did gain a positive effect in the long run. Even if the decreasing amount of beer tax resulting in a decreased proportion of alcohol-related fatalities seemed non-intuitive, it could be due to the fact that there is a time-lag/latency element to consider where people take time to adjust to high beer taxes before tending to lower alcohol purchase and intake, which ultimately results in lower alcohol-related fatalities.

With the important analysis explored, we now turn to analyzing the distribution of young drivers (aged between 15-24) throughout the years. While this may not be directly correlated with the current data analysis, it would be interesting to see if the change in the minimum drinking age has any effect on the distribution of young drivers. As expected, with the increase in the minimum drinking age, the proportion of young drivers decreased. This could be due to the decrease in proportion of legal young drivers.

With all of those in mind, we can then move on to fitting an appropriate model and produce some causal inference for this data.

5 Inferential analysis

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## unemp      -0.0567273  0.0071020 -7.9875 3.715e-14 ***
## jailyes    -0.0075778  0.1245342 -0.0608 0.9515232    
## drinkage19 -0.0421462  0.0662733 -0.6359 0.5253372    
## drinkage20 -0.0223700  0.0726344 -0.3080 0.7583287    
## drinkage21  0.0520453  0.0696946  0.7468 0.4558406    
## beertax    -0.5877703  0.1716524 -3.4242 0.0007099 ***
## spirits     0.6944051  0.0802470  8.6533 4.154e-16 ***
## dry         0.0279848  0.0134966  2.0735 0.0390522 *  
## serviceyes  0.0361896  0.1439526  0.2514 0.8016915    
## breathyes   0.0206776  0.0506726  0.4081 0.6835426    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = fatal_r ~ unemp + jail + drinkage + beertax + spirits + 
##     dry + service + breath, data = data, model = "within", index = c("state", 
##     "year"))
## 
## Unbalanced Panel: n = 48, T = 6-7, N = 335
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -0.4718637 -0.0788562  0.0018826  0.0794093  0.6230710 
## 
## Coefficients:
##              Estimate Std. Error t-value  Pr(>|t|)    
## unemp      -0.0567273  0.0071020 -7.9875 3.715e-14 ***
## jailyes    -0.0075778  0.1245342 -0.0608 0.9515232    
## drinkage19 -0.0421462  0.0662733 -0.6359 0.5253372    
## drinkage20 -0.0223700  0.0726344 -0.3080 0.7583287    
## drinkage21  0.0520453  0.0696946  0.7468 0.4558406    
## beertax    -0.5877703  0.1716524 -3.4242 0.0007099 ***
## spirits     0.6944051  0.0802470  8.6533 4.154e-16 ***
## dry         0.0279848  0.0134966  2.0735 0.0390522 *  
## serviceyes  0.0361896  0.1439526  0.2514 0.8016915    
## breathyes   0.0206776  0.0506726  0.4081 0.6835426    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    10.785
## Residual Sum of Squares: 7.3356
## R-Squared:      0.31982
## Adj. R-Squared: 0.17986
## F-statistic: 13.0247 on 10 and 277 DF, p-value: < 2.22e-16
## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## unemp      -0.0133817  0.0055587 -2.4073   0.01672 *  
## jailyes     0.2039941  0.0974724  2.0928   0.03727 *  
## drinkage19  0.0370111  0.0518718  0.7135   0.47613    
## drinkage20  0.0122483  0.0568506  0.2154   0.82958    
## drinkage21  0.0302554  0.0545496  0.5546   0.57959    
## beertax    -0.2039728  0.1343515 -1.5182   0.13010    
## spirits     0.3989548  0.0628089  6.3519 8.692e-10 ***
## dry         0.0048993  0.0105637  0.4638   0.64316    
## serviceyes -0.1983558  0.1126710 -1.7605   0.07943 .  
## breathyes  -0.0152399  0.0396612 -0.3843   0.70109    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = afatal_r ~ unemp + jail + drinkage + beertax + 
##     spirits + dry + service + breath, data = data, model = "within", 
##     index = c("state", "year"))
## 
## Unbalanced Panel: n = 48, T = 6-7, N = 335
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -0.9489734 -0.0571550 -0.0073183  0.0504910  0.4753808 
## 
## Coefficients:
##              Estimate Std. Error t-value  Pr(>|t|)    
## unemp      -0.0133817  0.0055587 -2.4073   0.01672 *  
## jailyes     0.2039941  0.0974724  2.0928   0.03727 *  
## drinkage19  0.0370111  0.0518718  0.7135   0.47613    
## drinkage20  0.0122483  0.0568506  0.2154   0.82958    
## drinkage21  0.0302554  0.0545496  0.5546   0.57959    
## beertax    -0.2039728  0.1343515 -1.5182   0.13010    
## spirits     0.3989548  0.0628089  6.3519 8.692e-10 ***
## dry         0.0048993  0.0105637  0.4638   0.64316    
## serviceyes -0.1983558  0.1126710 -1.7605   0.07943 .  
## breathyes  -0.0152399  0.0396612 -0.3843   0.70109    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    5.3854
## Residual Sum of Squares: 4.4939
## R-Squared:      0.16555
## Adj. R-Squared: -0.0061595
## F-statistic: 5.49553 on 10 and 277 DF, p-value: 1.8671e-07

6 Sensitivity analysis

7 Causal interpretation

In panel data, measures were conducted on the same entity (state) repeatedly at different time points (years). Also, the fixedd effect model adopted in the current project accounted for unobserved, entity-specific, time-invariant confounders. Given these features, it might seem reasonble to make causal inference for significant predictors on the response variable. However, a fixed effect model requires strong exogeneity assumptions in order to make causal inference, including: (a) no unobserved time-varying confounders; (b) past outcomes do not directly affect current outcome; (c) past treatments do not directly affet current outcome; (d) past outcome do not directly affect current treatment (reverse causation).

Assumption (a) is hard to verify and also difficult to relax under the fixed effect model. Thus we assumed no time-varying covariates were omitted from the current model and see whether the other assumptions were violated in the current model and how they can be relaxed. Assumption (b) can be relaxed without interfering with the causal inference between current treatment and current outcome so long as we condition on past treatment, and assuming past outcome does no directly affect current treatment. To relax assumption (c), we could add a small number of lagged treatment effect into the model (e.g. treatment from the year before). Last, for assumption (d): no reverse causation, a popular approach to relax it is to include instrumental variables for endogenous predictors. Endogenous predictors are those included in the model but are correlated with the error term. This could happen when the response variable can reversely cause the predictor, or some omitted confounders can affect both dependent and independent variables. Instrumental variables were those not included in the model, associated with the endogenous predictor, but not associated with the unobserved confounders.

Some previous studies on the traffic policy environment and fatality rate suggested using alcohol regulations as instrumental variables for alcohol consumption when investigating the effect of alcohol consumption on traffic accidents fatality. Such alcohol regulations can only affect traffic accident fatality through alcohol consumption, and there were previous studies showing significant effect of such regulations on alcohol consumption. In the current dataset, the covariate related to alcohol consumption is “spirits”, and alcohol regulations include “drinkage” (minimum drinking age), and “beertax”. To verify the approporiateness of drinkage and beertax as instrumental variables for spirits consumption, under-identification, weak instrument, and over-identification need to be tested. To test for under-identification is to test the null hypothesis that spirits and beertax or drink age are irrelevant. This could be done through simple t-test and likelihood ratio test. The result showed that beertax was not associated with spirits consumption (Pr(>F) = 0.1012), but drinkage had significant effect (Pr(>F) <0.0001). Thus, beertax failed the under-idetification test. Weak instrument was tested by calculating Cragg-Donald F statistic and comparing it against Stock and Yogo critical values. The null hypothesis (the instrumental variables are weak) can be rejected if the Crgg-Donald F statistic is greater than the criticla value. The Cragg-Donald F statistic calculated for drinkage was 10.59, and the critical value was 22.3, thus we failed to reject the null at significance level 0.05. As a result, we could not find appropriate instrumental variables for spirits in the current dataset. If more measures are availble, such as other alcohol regulations and other alcohol consumption information, we might be able to find more suitable instrument variables.

8 Discussion

Acknowledgement

Reference

Session info

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] plm_2.4-0       gplots_3.1.1    panelr_0.7.5    lme4_1.1-25    
##  [5] Matrix_1.2-18   GGally_2.0.0    forcats_0.5.0   stringr_1.4.0  
##  [9] dplyr_1.0.2     purrr_0.3.4     readr_1.4.0     tidyr_1.1.2    
## [13] tibble_3.0.4    tidyverse_1.3.0 plotly_4.9.2.1  ggplot2_3.3.2  
## [17] AER_1.2-9       survival_3.2-7  sandwich_3.0-0  lmtest_0.9-38  
## [21] zoo_1.8-8       car_3.0-10      carData_3.0-4  
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-150       bitops_1.0-6       fs_1.5.0           lubridate_1.7.9.2 
##  [5] RColorBrewer_1.1-2 httr_1.4.2         tools_4.0.3        backports_1.2.0   
##  [9] R6_2.5.0           KernSmooth_2.23-18 DBI_1.1.0          lazyeval_0.2.2    
## [13] colorspace_2.0-0   withr_2.3.0        tidyselect_1.1.0   gridExtra_2.3     
## [17] curl_4.3           compiler_4.0.3     cli_2.2.0          rvest_0.3.6       
## [21] xml2_1.3.2         labeling_0.4.2     caTools_1.18.0     scales_1.1.1      
## [25] digest_0.6.27      minqa_1.2.4        foreign_0.8-80     rmarkdown_2.5     
## [29] rio_0.5.16         pkgconfig_2.0.3    htmltools_0.5.0    dbplyr_2.0.0      
## [33] htmlwidgets_1.5.2  rlang_0.4.8        readxl_1.3.1       rstudioapi_0.13   
## [37] generics_0.1.0     farver_2.0.3       jsonlite_1.7.1     gtools_3.8.2      
## [41] crosstalk_1.1.0.1  zip_2.1.1          magrittr_2.0.1     Formula_1.2-4     
## [45] Rcpp_1.0.5         munsell_0.5.0      fansi_0.4.2        abind_1.4-5       
## [49] lifecycle_0.2.0    stringi_1.5.3      yaml_2.2.1         gbRd_0.4-11       
## [53] MASS_7.3-53        plyr_1.8.6         bdsmatrix_1.3-4    crayon_1.3.4      
## [57] lattice_0.20-41    haven_2.3.1        splines_4.0.3      pander_0.6.3      
## [61] jtools_2.1.2       hms_0.5.3          knitr_1.30         pillar_1.4.7      
## [65] boot_1.3-25        reprex_1.0.0       glue_1.4.2         evaluate_0.14     
## [69] data.table_1.13.2  modelr_0.1.8       Rdpack_2.1         nloptr_1.2.2.2    
## [73] vctrs_0.3.5        miscTools_0.6-26   cellranger_1.1.0   gtable_0.3.0      
## [77] reshape_0.8.8      assertthat_0.2.1   xfun_0.19          openxlsx_4.2.3    
## [81] rbibutils_2.0      broom_0.7.2        viridisLite_0.3.0  maxLik_1.4-6      
## [85] statmod_1.4.35     ellipsis_0.3.1